Towards an integrated representation of multiple layers of linguistic annotation in multilingual corpora
نویسنده
چکیده
There has been an increasing interest in recent years in the enrichment of natural language corpora in terms of annotation with explicit linguistic information. This interest manifests itself most prominently in two areas of linguistics: corpus linguistics and computational linguistics. For corpus linguistics, the long standing practice has been to work on raw, i.e., unannotated text. While raw corpora are basically fine for some kinds of linguistic work, notably for lexicology and lexicography, for other kinds of linguistic analysis tasks, e.g., for syntactic or semantic analysis, the information that needs to be extracted is not readily derivable from raw text. Thus, corpora have to be annotated with linguistic categories in order to be able to extract the desired kinds of information. For such annotation to be practicable at all, the annotation process needs to be carried out automatically or at least semi-automatically.
منابع مشابه
Towards a new level of annotation detail of multilingual speech corpora
The aim of this paper is to highlight the actual need for corpora that have been annotated based on acoustic information. The acoustic information should be coded in features or properties and is needed to inform further processing systems, i.e. to present a basis for a speech recognition system using linguistic information. Feature annotation of existing corpora in combination with segmental a...
متن کاملLinguistically Annotated Corpus as an Invaluable Resource for Advancements in Linguistic Research: A Case Study
A case study based on experience in linguistic investigations using annotated monolingual and multilingual text corpora; the “cases” include a description of language phenomena belonging to different layers of the language system: morphology, surface and underlying syntax, and discourse. The analysis is based on a complex annotation of syntax, semantic functions, information structure and disco...
متن کاملANNIS3: A new architecture for generic corpus query and visualization
This paper is concerned with the data structures, properties of query languages and visualization facilities required for the generic representation of richly annotated, heterogeneous linguistic corpora. We propose that above and beyond a general graph based data-model, which is becoming increasingly popular in many complex annotation formats, a well-defined concept of multiple, potentially con...
متن کاملTowards a new level of anotation detail of multilingual speech corpora
The aim of this paper is to highlight the actual need for corpora that have been annotated based on acoustic information. The acoustic information should be coded in features or properties and is needed to inform further processing systems, i.e. to present a basis for a speech recognition system using linguistic information. Feature annotation of existing corpora in combination with segmental a...
متن کاملSusTEInability of linguistic resources through feature structures
This article shows that the TEI tag set for feature structures can be adopted to represent a heterogeneous set of linguistic corpora. The majority of corpora is annotated using markup languages that are based on the Annotation Graph framework, the upcoming Linguistic Annotation Format ISO standard, or according to tag sets defined by or based upon the TEI guidelines. A unified representation co...
متن کامل